Least Squares Linear Regression

In this lecture we will introduce linear models and least squares linear regression. We will explain the basic parametric model formulation and how to optimize it using linear algebra and geometry rather than gradient descent.

Imports

In this notebook we will be using the following packages:

In [1]:
import numpy as np
import pandas as pd
In [2]:
import plotly.offline as py
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
import cufflinks as cf
cf.set_config_file(offline=True, sharing=False, theme='ggplot');
In [3]:
import torch
from sklearn.linear_model import LinearRegression
In [4]:
# Support embedding YouTube Videos in Notebooks
from IPython.display import YouTubeVideo

A Note on Lecture Format

This is the first of a series of lectures that are designed to be completed online rather than in class. This new lecture format is experimental and we ask for your feedback on Piazza.

The playlist for all the videos in this lecture can be found here:

Least Squares Linear Regression

Intended Use

We have designed this video notebook to be a series of short videos followed by simple questions (with answers provided) and demos. Note that all the solutions are provided and there are no grades for working through this notebook. However, we hope that by combining videos with activities this will actually be more engaging.

Recap

The following short video provides a recap of the modeling recipe (model, loss, optimize). If you are comfortable with this you can safely skip this video.

In [5]:
YouTubeVideo("u0YycwQ814c")
Out[5]:

Recap Question

What are the three stages of modeling used in this class?

Answer In this class, the recipe for modeling consists of defining the model, defining the loss, and optimizing the model with respect to the loss.

Multilinear Regression Setting

In the following video we introduce the multiple linear regression setting.

In [6]:
YouTubeVideo("21PN6a3s22I")
Out[6]:

The Diamonds Dataset

To motivate the multilinear regression setting we will work with the classic diamonds dataset.

In [7]:
data = pd.read_csv("data/diamonds.csv.zip", index_col=0)
data.head()
Out[7]:
carat cut color clarity depth table price x y z
1 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63
5 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75

Each record in the dataset corresponds to a single diamond. The fields are

  1. carat: The weight of the diamonds
  2. cut: The quality of the cut is an ordinal variable with values (Fair, Good, Very Good, Premium, and Ideal)
  3. color: The color of the diamond is an ordinal variable with values J (worst) to D (best).
  4. clarity: How obvious inclusions are within the diamond. This is an ordinal variable with values I1 (worstt), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best)
  5. depth: The height of a diamond, measured from the culet to the table, divided by its average girdle diameter.
  6. table: The width of the diamond's table expressed as a percentage of its average diameter.
  7. price: Price of the diamond in USD.
  8. x: length measured in mm
  9. y: width measured in mm
  10. z: depth measured in mm

We are interested in predicting the price of a diamond given it's characteristics. Therefore, we would like to fit a model:

$$ f_\theta\left(\text{Diamond's Characteristics}\right) \rightarrow \text{Price} $$

Suppose for now we focus only on the diamond's carat, depth, and table values. In this case, we would want a model of the form:

$$ f_\theta\left(\textbf{carat}, \textbf{depth}, \textbf{table}\right) \rightarrow \textbf{price} $$

We could express this as a linear model of the form:

$$ f_\theta\left(\textbf{carat}, \textbf{depth}, \textbf{table}\right) = \theta_0 + \theta_1 * \textbf{carat} + \theta_2 * \textbf{depth} + \theta_3 * \textbf{table} $$

Where $\theta_0$ is the intercept parameter and $\theta_1$, ..., $\theta_3$ are the "slopes" with respect to each of the covariats.

Focusing on just the carat, depth, and table features and the price our dataset looks like:

In [8]:
data[['carat', 'depth', 'table', 'price']]
Out[8]:
carat depth table price
1 0.23 61.5 55.0 326
2 0.21 59.8 61.0 326
3 0.23 56.9 65.0 327
4 0.29 62.4 58.0 334
5 0.31 63.3 58.0 335
... ... ... ... ...
53936 0.72 60.8 57.0 2757
53937 0.72 63.1 55.0 2757
53938 0.70 62.8 60.0 2757
53939 0.86 61.0 58.0 2757
53940 0.75 62.2 55.0 2757

53940 rows × 4 columns

In lecture, we saw that the predictions for the entire data set $\hat{\mathbb{Y}}$ can be computed as:

$$ \hat{\mathbb{Y}} = \mathbb{X} \theta $$

The covariate matrix $\mathbb{X} \in \mathbb{R}^{n,d+1}$ has $n$ rows corresponding to each record in the dataset and $d+1$ columsn corresponding to the origina $d$ columns in the data plus an additional 1s column vector. For example, the first five rows of $\mathbb{X}$ for this dataset would look like:

0 1 2 3
0 1.0 0.23 61.5 55.0
1 1.0 0.21 59.8 61.0
2 1.0 0.23 56.9 65.0
3 1.0 0.29 62.4 58.0
4 1.0 0.31 63.3 58.0

We can extract our covariate matrix and add a column of all 1s:

In [9]:
n,d = data[['carat', 'depth', 'table']].shape
print("n = ", n)
print("d = ", d)
n =  53940
d =  3
In [10]:
X = np.hstack([np.ones((n,1)), data[['carat', 'depth', 'table']].to_numpy()])
X
Out[10]:
array([[ 1.  ,  0.23, 61.5 , 55.  ],
       [ 1.  ,  0.21, 59.8 , 61.  ],
       [ 1.  ,  0.23, 56.9 , 65.  ],
       ...,
       [ 1.  ,  0.7 , 62.8 , 60.  ],
       [ 1.  ,  0.86, 61.  , 58.  ],
       [ 1.  ,  0.75, 62.2 , 55.  ]])

We can define the linear model in python using numpy. Here I will assume the input is in row vector form so this function will work with a single row or the entire matrix. Note in lecture we discuss the case where a single row is stored in column vector form (as this is the case in most math textbooks).

In [11]:
def linear_model(theta, xt): 
    return xt @ theta # The @ symbol is matrix multiply

If we guess a value for $\theta$ we can make a prediction using our model:

In [12]:
theta_guess = np.random.randn(d+1, 1)
theta_guess
Out[12]:
array([[-0.60597966],
       [ 0.96287929],
       [ 1.05907899],
       [ 0.79090442]])

Making a single prection for the 3rd row:

In [13]:
linear_model(theta_guess, X[2,:])
Out[13]:
array([111.28586449])

Making a prediction for all the rows at once:

In [14]:
linear_model(theta_guess, X)
Out[14]:
array([[108.24858368],
       [111.17431831],
       [111.28586449],
       ...,
       [114.03246173],
       [110.69837139],
       [109.49063621]])

Copying the notation directly from the video lecture:

In [15]:
Y_hat = X @ theta_guess
Y_hat
Out[15]:
array([[108.24858368],
       [111.17431831],
       [111.28586449],
       ...,
       [114.03246173],
       [110.69837139],
       [109.49063621]])

Defining and Minimizing the Loss

The next video defines and minimizes the least squares loss by using a geometric argument.

In [16]:
YouTubeVideo("zkJ3CULN8Wk")
Out[16]:

The Least Squares Loss

The Least Squares Regression loss is simply the average squared loss.

$$ L(\theta) = \frac{1}{n}\sum_{i=1}^n \left( \mathbb{Y}_i - \mathbb{X}_i \theta \right)^2 $$

Here $\mathbb{Y}_i$ is the $i$th observed response (diamond price) and $\mathbb{X}_i$ is the row vector coresponding to the $i$th row of the covariates with an added 1 at the beginning.

We need to define the response vector $\mathbb{Y}$

In [17]:
Y =  data[['price']].to_numpy()
Y
Out[17]:
array([[ 326],
       [ 326],
       [ 327],
       ...,
       [2757],
       [2757],
       [2757]])

Directly writing the average squared loss:

In [18]:
def squared_loss(theta):
    return np.mean((Y - X @ theta)**2)
In [19]:
squared_loss(theta_guess)
Out[19]:
30516449.245282363

Using matrix notation:

In [20]:
def squared_loss(theta):
    return ((Y - X @ theta).T @ (Y - X @ theta)).item() / Y.shape[0]
In [21]:
squared_loss(theta_guess)
Out[21]:
30516449.24528237

Minimizing the Loss

In the video lecture we defined the normal equation as:

$$ \mathbb{X}^T \mathbb{X} \hat{\theta} = \mathbb{X}^T \mathbb{Y} $$

If we solve for $\hat{\theta}$ this is the minimizer of the squared loss with respect to our data.

If we assume $\mathbb{X}^T \mathbb{X}$ is invertible (full rank) then we get the following solution:

$$ \hat{\theta} = \left( \mathbb{X}^T \mathbb{X} \right)^{-1} \mathbb{X}^T \mathbb{Y} $$

Minimizing the Loss

The following video provides a better intuition of the shape of each of the terms in the solution to the normal equation as well as how to better solve it.

In [22]:
YouTubeVideo("9vWR-19WQKU")
Out[22]:

While it is not as numerically stable or efficient. We can compute $\hat{\theta}$ by direction using matrix inversion:

In [23]:
from numpy.linalg import inv, solve
In [24]:
theta_hat = inv(X.T @ X) @ X.T @ Y
theta_hat
Out[24]:
array([[13003.4405242 ],
       [ 7858.77050994],
       [ -151.23634689],
       [ -104.47278016]])
In [25]:
theta_hat = solve(X.T @ X, X.T @ Y)
theta_hat
Out[25]:
array([[13003.4405242 ],
       [ 7858.77050994],
       [ -151.23634689],
       [ -104.47278016]])

The solve can be implement slightly more efficiently by avoiding the inversion and subsequent matrix multiply. For matrices which are less well behaved, solve can also be more numerically stable.

Comparing a Random Guess and Our Solution

We can compare the $\hat{\theta}$ to our earlier random guess. Notice here I will take the square root to make the loss in units of US dollars.

In [26]:
print("theta_hat", np.sqrt(squared_loss(theta_hat)))
theta_hat 1526.037612639195
In [27]:
print("theta_guess", np.sqrt(squared_loss(theta_guess)))
theta_guess 5524.169552546552

Better than random guess but not by much.

Making Predictions

Now that we have estimated the model parameters $\hat{\theta}$ we can now also predict the price $\hat{\mathbb{Y}}$ for each of our diamonds

In [28]:
Y_hat = X @ theta_hat

Diagnosing the Model

In previous lectures we have plotted the residual. We can do that for each of the dimensions but in general it will be difficult to visualize the residual directly.

In [29]:
fig = px.scatter(x = X[:,1], y = Y - Y_hat,opacity=0.2)
fig.update_xaxes(title="Carat")
fig.update_yaxes(title="Residual")
fig
In [30]:
fig = px.scatter(x = X[:,2], y = Y - Y_hat,opacity=0.2)
fig.update_xaxes(title="Depth")
fig.update_yaxes(title="Residual")
fig
In [31]:
fig = px.scatter(x = X[:,3], y = Y - Y_hat,opacity=0.2)
fig.update_xaxes(title="Table")
fig.update_yaxes(title="Residual")
fig

A better way to visualize higher dimensional regression models is to compare the Predicted Value against the Observed value

In [32]:
fig = px.scatter(x = Y.flatten(), y = Y_hat.flatten(), opacity=0.2)
fig.add_trace(go.Scatter(x=[0,20000], y=[0, 20000], name="Y_hat=Y"))
fig.update_xaxes(title="Y")
fig.update_yaxes(title="Y_hat")
fig

What do we see? We are under estimating expensive diamonds! Perhaps we should consider the cut, color, and clarity.

Whats Next

In the next notebook lecture we will see how to improve this model by applying feature engineering techniques to incorporate categorical data.